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Abstract — The problem of neural network association is to 
retrieve a previously memorized pattern from its noisy version 
using a network of neurons. An ideal neural network should 
include three components simultaneously: a learning algorithm, 
a large pattern retrieval capacity and resilience against noise. 
Prior works in this area usually improve one or two aspects at 
the cost of the third. 

Our work takes a step forward in closing this gap. More 
specifically, we show that by forcing natural constraints on the 
set of learning patterns, we can drastically improve the retrieval 
capacity of our neural network. Moreover, we devise a learning 
algorithm whose role is to learn those patterns satisfying the 
above mentioned constraints. Finally we show that our neural 
network can cope with a fair amount of noise. 

I. Introduction 

Neural networks are famous for their ability to learn and 
reliably perform a required task. An important example is the 
case of (associative) memory where we are asked to memorize 
(learn) a set of given patterns. Later, corrupted versions of the 
memorized patterns will be shown to us and we have to return 
the correct memorized patterns. In essence, this problem is 
very similar to the one faced in communication systems where 
the goal is to reliably transmit and efficiently decode a set of 
patterns (so called codewords) over a noisy channel. 

As one would naturally expect, reliability is certainly a 
very important issue both the neural associative memories 
and in communication systems. Indeed, the last three decades 
witnessed many reliable artificial associative neural networks. 
See for instance ||4), |[13), 1[14), |[10), 1[12), |[18). 

However, despite common techniques and methods de- 
ployed in both fields (e.g., graphical models, iterative algo- 
rithms, etc), there has been a quantitative difference in terms of 
another important criterion: the efficiency. ver the past decade, 
by using probabilistic graphical models in communication 
systems it has become clear that the number of patterns that 
can be reliably transmitted and efficiently decoded over a noisy 
channel is exponential in n, length of the codewords, pO| . 
However, using current neural networks of size n to memorize 
a set of randomly chosen patterns, the maximum number of 
patterns that can be reliably memorized scales linearly in n 
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There are multiple reasons for the inefficiency of the storage 
capacity of neural networks. First, neurons can only perform 
simple operations. As a result, most of the techniques used in 
communication systems (more specifically in coding theory) 
for achieving exponential storage capacity are prohibitive in 



neural networks. Second, a large body of past work (e.g., ||4J, 
1 13 1, 1 14 1, 1 10 1) followed a common assumption that a neural 
network should be able to memorize any subset of patterns 
drawn randomly from the set of all possible vectors of length 
n. Although this assumption gives the network a sense of 
generality, it reduces its storage capacity to a great extent. 

An interesting question which arises in this context is 
whether one can increase the storage capacity of neural 
networks beyond the current linear scaling and achieve results 
similar to coding theory. To this end, Kumar et al. Q 
suggested a new formulation of the problem where only a 
suitable set of patterns was considered for storing. This way 
they could show that the performance of neural networks in 
terms of storage capacity increases significantly. Following 
the same philosophy, we will focus on memorizing a random 
subset of patterns of length n such that the dimension of the 
training set is fc < rt. In other words, we are interested in 
memorizing a set of patterns that have a certain degree of 
structure and redundancy. We exploit this structure both to 
increase the number of patterns that can be memorized (from 
linear to exponential) and to increase the number of errors that 
can be corrected when the network is faced with corrupted 
inputs. 

The success of |l2l is mainly due to forming a bipartite 
network/graph (as opposed to a complete graph) whose role 
is to enforce the suitable constraints on the patterns, very 
similar to the role played by Tanner graphs in coding. More 
specifically, one layer is used to feed the patterns to the 
network (so called variable nodes in coding) and the other 
takes into account the inherent structure of the input patterns 
(so called check nodes in coding). A natural way to enforce 
structures on inputs is to assume that the connectivity matrix 
of the bipartite graph is orthogonal to all of the input patterns. 
However, the authors in I^J heavily rely on the fact that the 
bipartite graph is fully known and given, and satisfies some 
sparsity and expansion properties. The expansion assumption 
is made to ensure that the resulting set of patterns are resilient 
against fair amount of noise. Unfortunately, no algorithm for 
finding such a bipartite graph was proposed. 

Our main contribution in this paper is to relax the above as- 
sumptions while achieving better error correction performance. 
More specifically, we first propose an iterative algorithm that 
can find a sparse bipartite graph that satisfies the desired set of 
constraints. We also provide an upper bound on the block error 
rate of the method that deploys this learning strategy. We then 



proceed to devise a multi-layer network whose performance 
in terms of error tolerance improves significantly upon |j2) and 
no longer needs to be an expander. 

The remainder of this paper is organized as follows. In 
Section [II] we formally state the problem that is the focus 
of this work, namely neural association for a network of non- 
binary neurons. We then provide an overview of the related 
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work in this area in Section II-A We present our pattern 
learning algorithm in Section |III| and the multi-level network 
design in Section |TV| The simulations supporting our analytical 
results are shown in Section |V] Finally future works are 



explained in Section VI 



II. Problem Formulation 

In contrast to the mainstream work in neural associative 
memories, we focus on non-binary neurons, i.e., neurons that 
can assume a finite set of integer values S = {0, 1, . . . , S— 1} 
for their states (where 5 > 2). A natural way to interpret the 
multi-level states is to think of the short-term (normalized) 
firing rate of a neuron as its output. Neurons can only perform 
simple operations. In particular, we restrict the operations at 
each neuron to a linear summation over the inputs, and a 
possibly non-linear thresholding operation. In particular, a 
neuron x updates its state based on the states of its neighbors 
follows: 

1) It computes the weighted sum h = 'Yll=l'^i^^^ where 
Wi denotes the weight of the input link from s^. 

2) It updates its state as a; = /(/i), where / : M — > 5 
is a possibly non-linear function from the field of real 
numbers M to 5. 

Neural associative memory aims to memorize C patterns of 
length n by determining the weighted connectivity matrix 
of the neural network {learning phase) such that the given 
patterns are stable states of the network. Furthermore, the 
network should be able to tolerate a fair amount of noise so 
that it can return the correct memorized pattern in response 
to a corrupted query (recall phase). Among the networks with 
these two abihties, the one with largest C is the most desirable. 

We first focus on learning the connectivity matrix of a neural 
graph which memorizes a set of patterns having some inherent 
redundancy. More specifically, we assume to have C vectors 
of length n with non-negative integer entries, where these 
patterns form a subspace of dimension k < n. We would 
like to memorize these patterns by finding a set of non-zero 
vectors wi,...,w„i e M" that are orthogonal to the set of 
given patterns. Furthermore, we are interested in rather sparse 
vectors. Putting the training patterns in a matrix Xcxn and 
focusing on one such vector w, we can formulate the problem 
as; 

min \\X ■ w\\2 (la) 
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Fig. L A bipartite grapli that represents the constraints on the training set. 

yields a sparse bipartite graph which corresponds to the basis 
vectors of the null space specified by the patterns in the 
training set. In other words, the inherent structure of the 
patterns is captured in terms of m linear constraints on the 
entries of the patterns a;^ in the training set. It can therefore be 
described by Figure [l] with a connectivity matrix W G K™^" 
and a vector 6 e K™ such that Wx'' = for all ^ = 1, . . . , C. 
In this paper, we assume & = for simplicity, where is the 
all-zero vector. 

In the recall phase, the neural network is fed with noisy 
inputs. A possibly noisy version of an input pattern is initial- 
ized as the states of the pattern neurons xi,X2, ■ ■ ■ , Xn- Here, 
we assume that the noise is integer valued and additiv^ In 
formula, we have y = W{x^ + z) = Wz where z is the noise 
added to pattern x^ and we used the fact that Wx^^ = 0. 
Therefore, one can use y = Wz to eliminate the input noise z. 
Consequently, we are searching an algorithm that can provably 
eliminate the effect of noise and return the correct pattern. 

Remark 1. A solution in the learning/recall phase is accept- 
able only if it can be found by simple operations at neurons. 

Before presenting our solution, we briefly overview the 
relation between the previous works and the one presented 
in this paper. 

A. Related Works 

Designing a neural network capable of learning a set of 
patterns and recalling them later in presence of noise has been 
an active topic of research for the past three decades. Inspired 
by the Hebbian learning rule [81, Hopfield in his seminal 
work Q introduced the Hopfield network: an auto-associative 
neural mechanism of size n with binary state neurons in 
which patterns are assumed to be binary vectors of length n. 
The capacity of a Hopfield network under vanishing bit error 
probability was later shown to be 0.13n by Amit et al. |j6). 
Later on, McEliece et al. proved that the capacity of Hopfield 
networks under vanishing block error probability requirement 
is 0(n/log(n)) | [TT) . Similar results were obtained for sparse 
regular neural network in fO^. It is also known that the capacity 
of neural associative memories could be enhanced if the 
patterns are sparse in the sense that at any time instant many 



where q G N determines the degree of sparsity and e e M+ 
prevents the all-zero solution. A solution to the above problem 



'it must be mentioned that neural states below and above S will be set 
to and S, respectively. 



of the neurons are silent Q. However, even these schemes 
fail when required to correct a fair amount of erroneous bits 
as the information retrieval is not better compared to that of 
normal networks. 

In addition to neural networks capable of learning patterns 
gradually, in [13|, the authors calculate the weight matrix 
offline (as opposed to gradual learning) using the pseudo- 
inverse rule fl] which in return help them improve the capacity 
of a Hopfield network to n/2 random patterns with the ability 
of one bit error correction. 

Due to the low capacity of Hopfield networks, extension of 
associative memories to non-binary neural models has also 
been explored in the past. Hopfield addressed the case of 
continuous neurons and showed that similar to the binary 
case, neurons with states between —1 and 1 can memorize 
a set of random patterns, albeit with less capacity ||5|. In 1 14| 
the authors investigated a multi-state complex-valued neural 
associative memories for which the estimated capacity is C < 
0.15n. Under the same model but using a different learning 
method, Muezzinoglu et al. |10| showed that the capacity 
can be increased to C — n. However the complexity of the 
weight computation mechanism is prohibitive. To overcome 
this drawback, a Modified Gradient Descent learning Rule 
(MGDR) was devised in [B). 

Given that even very complex offline learning methods can 
not improve the capacity of binary or multi-sate Hopfield 
networks, a line of recent work has made considerable efforts 
to exploit the inherent structure of the patterns in order 
to increase both capacity and error correction capabilities. 
Such methods either make use of higher order correlations 
of patterns or focus merely on those patterns that have some 
sort of redundancy. As a result, they differ from previous 
methods for which every possible random set of patterns was 



considered. Pioneering this prospect, Berrou and Gripon (18| 
achieved considerable improvements in the pattern retrieval 
capacity of Hopfield networks, by utilizing Walsh-Hadamard 
sequences. This improvement was made by paying the price 
of using a decoder based on winner-take-all approach which 
requires a separate neural network. Therefore, this approach 
increases the complexity of the overall method. Using low 
correlation sequences has also been considered in flT], which 
results in increasing the storage capacity of Hopfield networks 
to n without requiring any separate decoding stage. 

In contrast to the pairwise correlation of the Hopfield model 
Q, Peretto et al. |17J deployed higher order neural models: 
the state of the neurons not only depends on the state of their 
neighbors, but also on the correlation among them. Under this 
model, they showed that the storage capacity of a higher-order 
Hopfield network can be improved to C = 0(nP^^), where 
p is the degree of correlation considered. The main drawback 
of this model was again the huge computational complexity 
required in the learning phase. To address this difficulty while 
being able to capture higher-order correlations, a bipartite 
graph inspired from iterative coding theory was introduced in 
Q. Under the assumptions that the bipartite graph is known, 
sparse, and expander, the proposed algorithm increased the 



pattern retrieval capacity to C — 0{a"), for some a > 1. 
The main drawbacks in the proposed approach is the lack of 
a learning algorithm as well as the assumption that the weight 
matrix should be an expander The sparsity criterion on the 
other hand, as it was noted by the authors, is necessary in the 
recall phase and biologically more meaningful. 

In this paper, we focus on solving the above two problems 
in [T|. We start by proposing an iterative learning algorithm 
that identifies a sparse weight matrix W. The wight matrix W 
should satisfy a set of linear constraints Wx^ = for all the 
patterns in the training data set, where fi ~ 1, . . . ,C. We 
then propose a novel network architecture which eliminates 
the need for the expansion criteria while achieving better 
performance than the error correction algorithm proposed in 
|2J. 

Learning linear constraints by a neural network is hardly 
a new topic as one can learn a matrix orthogonal to a set of 
patterns in the training set (i.e., Wx'^ = 0) using simple neural 
learning rules (we refer the interested readers to ||3) and p6)). 
However, to the best of our knowledge, finding such a matrix 
subject to the sparsity constraints has not been investigated 
before. This problem can also be regarded as an instance of 
compressed sensing {2T\, in which the measurement matrix 
is given by the big patterns matrix Xcxn and the set of 
measurements are the constraints we look to satisfy, denoted 
by the tall vector b, which for simplicity reasons we assume to 
be all zero. Thus, we are interested in finding a sparse vector 
w such that Xw = 0. 

Nevertheless, many decoders proposed in this area are very 
complicated and cannot be implemented by a neural network 
using simple neuron operations. Some exceptions are ||T| and 
1 19 1 from which we derive our learning algorithm. 



III. Learning Algorithm 

We are interested in an iterative algorithm that is simple 
enough to be implemented by a network of neurons. Therefore, 
we first relax ([T]) as follows: 



min \\X ■ w\\2 - HWMll - + ligM - q). 



(2) 



In the above problem, we have approximated the constraint 
ll'u^llo < Q with g{w) < q' since ||.||o is not a well-behaved 
function. The function g{w) is chosen such that it favors 
sparsity. For instance one can pick g{w) to be ||.||i, which 
leads to £i-norm minimizations. In this paper, we consider 
the function 

n 

g{w) = tanh(crw^) 

1=1 

where cr is chosen appropriately. By calculating the derivative 
of the objective function and primal-dual optimization tech- 
niques we obtain the following iterative algorithm for Q: 
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w(t + 1) = (1 + 2Xt)w{t) - 2at 
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Algorithm 1 Iterative Learning 
Input: pattern matrix X, stopping point p. 
Output: w 
while ||y(t)||max > p do 
Compute y{t) = ^^f^- 



Update w{t+l) = ri({l + 2Xt)w{t) - 2at 
Update Af+i 

end while 



[Xt + Sie^\\w\\l)]. 



Xt+i=[Xt + S{e-\\w\\l)] 
7t+i ^ bt + S{g{w) ~ q')] 



(3c) 
(3d) 



where t denotes the iteration number, is the transpose 
of matrix X, 6 and at are small step sizes and [•]+ denotes 
max(-, 0). 

For our choice of g{w), the j*'* entry of the function f{w) = 
\/g{w), denoted by fi{w) reduces to 2awi{l — tanh(crit;|)^). 
For very small values of Wi, fi{w) ~ Wi and for large values 
of Wi, fi{w) ~ 0. Therefore, by looking at (3bi we see that 
the last term is pushing small values in w{t + 1) towards zero 
while leaving the larger values intact. Therefore, we remove 
the last term completely and enforce small entries to zero in 
each update which in turn enforces sparsity. The final iterative 
learning procedure is shown in Algorithm [T] 

Here, 9t is a positive threshold at iteration t and rj{.)Q^ is 
the point-wise soft-thresholding function given below: 



r]{u) 



u \f u> 6, 
u if 7i < —9 
otherwise. 



(4) 



Remark 2. the above choice of soft-theresholding function is 
very similar to the one selected by Donoho et al. in [1] in 
order to recover a sparse signal from a set of measurements. 
The authors prove that their choice of soft-threshold function 
results in optimal sparsity-undersampling trade-off. 

The next theorem derives the necessary conditions on at, Xt 
and Of such that Algorithm [T] converges to a sparse solution. 

Theorem 1. If 9t ^ as t ^ oo and if Xt is bounded above 
by amin/(amax " amin)> then there is a proper choice of at in 
every iteration t that ensures constant decrease in the objective 
function \\X .w{t)\\-^s.^. Here flmin — II^^^IP/H'^II^ and 

a„iax = max^ \\x^'\\^|\\X\\^'. For Xt = 0, i.e. \\w{t)\\2 > e, 
picking < at < 1 ensures gradual convergence. 

Sketch of the proof: Let E{t) = ||?;(<)||max- We would 
like Let E{t) = ||j/(i)||max- We would like to show that E{t + 
1) < E{t) for all iterations t. To this end, let us denote (1 + 
2Xt)w{t) — 2atjp^ by w'[t). Furthermore, let the function 

9t) be U'~rj{u)o^. Rewriting the second step of algorithm 
([TJ we will have: 



w{t+l)=w'{t)-x{w\t)-9t) 



Now we have 
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where the last inequality follows because \\X\ 
Now expanding we will get 
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= (l + 2At)/cxc-2at; 
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Denoting the matrix {l + 2Xt)IcxC 
further simplify inequality (|6]l: 

E{t + l) < ||A2/(t)||max + 
< llAllmaxllyWII 
= llAllmax-B(t) ^ 



\\X\\; 



by Dt, we can 



Where D{t) = 
9t = 1/t (i.e. 



a + 2Xt)L 



t)J-CxC- 
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Therefore, if we set 
as i — >■ oo) and ensuring ||-Dt||max < 1 
we get E[t + 1) < E(i). The second requirement requires 
that \Dij[t)\ < 1 for all elements of Dt. Therefore, by 
letting A = XX^ we must have the following relationship 
for diagonal elements: 

2Xt-2atj^\<l (9) 



1 



which yields 



At < at]T-nr < 1 + At, 

11^112 



Vi = 1, 



Since At > for all t and < An < \\A\\2, the right hand side 
of the above inequality is satisfied if at < 1- The left-hand side 
is satisfied for at > Af/omin, where amin = minj Aij/||yl||2. 
Therefore, if At < amin/(amax — o-min) there exists and at 
< 1. If At = 0, this is simply equivalent to 
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having < a < 1. 



IV. Multi-level Network Architecture 

In the previous section, we discussed the details of a simple 
iterative learning algorithm which yields rather sparse graphs. 
Now in the recall phase, we propose a network structure 
together with a simple error correction algorithm (similar to 
the one in Q) to achieve good block error rates in response 
to noisy input patterns. The suggested network architecture is 
shown in Figure [2] To make the description clear and simple 
we only concentrate on a two-level neural network. However, 
the generalization of this idea is trivial and left to the reader 
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Fig. 2. A two-level en'or correcting neural network. 



Algorithm 2 Error Correction 

Input: pattern matrix X, threshold (p, iteration t^ax 

Output: Xi,X2: ■ ■ ■ ,Xn 
1: for t = 1 ^ imax dO 

2: Forward iteration: Calculate the weighted input sum 
hi = X)j=i ^ij^j' ^'^ch neuron j/i and set: 

[1, h,<0 
2/i = < 0, hi=0 
[ —1, otherwis^ 

3: Backward iteration: Each neuron Xj computes 

4: Update the state of each pattern neuron j according to 

Xj = Xj + sgn(gj) only if \gj \ > tp. 

5: t ^ t + l 

6: end for 

Note that in practice, we replace the condition hi — and 
hi > with \hi\ < e and hi > e for some small positive 
number e. 



The proposed approach is in contrast to the one in suggested 
in Q where the authors exploit a single-level neural network 
with a sparse and expander connectivity graph to correct at 
least two initial input errors. However, enforcing expansion 
on connectivity graphs in a gradual neural learning algorithm 
is extremely difficult, specially when the algorithm is required 
to be very simple Therefore, we use the learning algorithm 
explained above, which yields a rather sparse and not nec- 
essarily expander graph, and improve the error correction 
capabilities by modifying the network structure and error 
correcting algorithm. 

The idea behind this new architecture is that we divide the 
input pattern of size n into L sub-patterns of length n/L. 
Now we feed each sub-pattern to a neural network which 
enforces m constraint^ on the sub-pattern in order to correct 

^Note that although we do not allow neurons to have negative outputs, the 
set of outputs { — 1,0, 1} can be easily implemented by sending {0, 1, 2} and 
shift the response in the pattern neurons in each iteration. The shifting can 
be done by modifying the firing threshold for each neuron. 

^The number of constraints for different networks can vary. For simplicity 
of notifications we assume equal sizes. 



the input errors. The local networks in the first level and the 
global network in the second level use Algorithm |2] which 
is a variant of the "bit- flipping" method proposed in I^J, 
to correct the errors. Intuitively, if the states of the pattern 
neurons Xi correspond to a pattern from X (i.e., the noise- 
free case), then for all i — 1, . . . ,m we have yi — 0. The 
quantity gj can be interpreted as feedback to pattern neuron 
Xj from the constraint neurons. Hence, the sign of gj provides 
an indication of the sign of the noise that affects Xj, and \gj\ 
indicates the confidence level in the decision regarding the 
sign of the noise. 

Theorem 2. The neural network given in figure^can memo- 
rize an exponential number of patterns of length n with non- 
negative integer entries. 

Proof: The proposed neural network can potentially mem- 
orize any pattern that has the following two properties: 

1) The overall pattern of length n lies in the subspace of 
dimension kt determined by set of null basis vectors in 
the matrix Wg. 

2) Each sub-pattern i of length n/L lie in the sub-space 
determined by the set of null basis vectors Wi, i — 
l...,L. 

The first requirement means that we use nig out of n degrees 
of freedom for a pattern of length n. Spreading the remaining 
n — nig = kt of degrees of freedom over the sub-pattern 
clusters, there is at least one cluster which will have at least 
kt/L degrees of freedom. Now if fc < fct/L in this cluster, 
there are at least 5^*/^^'^ patterns that lie in this particular 
subspace. Therefore, assuming k and kt to be linear in n, the 
total number of patterns that can be memorized is exponential 
in n. 

Note that the above lines of analysis provide a very loose 
lower bound on the number of patterns that can be memorized 
since the linear constraints enforced on different sub-patterns 
do not need to be completely independent of each other 
Therefore, there is no need to spend all the degrees of freedom 
available on the patterns which increases the pattern retrieval 
capacity substantially. We exploit this fact in our simulations 
by generating sub-patterns randomly. 

■ 

Theorem 3. Algorithm^can correct a single error in the input 
pattern with high probability if if is chosen large enough. 

Proof: If only a single neuron is corrupted then only 
constraints connected to this node will be violated in the 
forward iteration. For simplicity, let us assume it is the first 
neuron which is corrupted with noise and the sign of noise 
is positive. Therefore, y — sign(wi), where wi denotes 
the first column of. the matrix W. Now in the backward 

iteration, oi = "'^'n^^'!''"'^'' = 1, where . denotes the inner 

ll-j^illi 

product. Therefore, gi > ip and pattern neuron 1 will be 
updated correctly. The other nodes on the other hand receive 
gj = "'^ ilf"^^™^^ ■ Given that during the learning phase the 
weight vectors and their signs are determined independently 



of each other, the probabihty of having sign(wi) — sign(u;j ) is 
small, which means if is chosen large enough, < Lp with 
high probability for j > 1. Therefore, a single error can be 
corrected with high probability using the suggested algorithm. 

■ 

Given that each local network is able to correct one pattern, 
L such networks can correct L input errors ;/ they are 
separated such that only one error appears in the input of 
each local network. Otherwise, there would be a probability 
that the network could not handle the errors. In that case, we 
feed the overall pattern of length n to the second layer with the 
connectivity matrix Wg, which enforces nig global constraints. 
And since the probability of correcting two erroneous nodes 
increases with the input size, we expect to have a better error 
correction probability in the second layer Therefore, using this 
simple scheme we expect to gain a lot in correcting errors in 
the patterns. In the next section, we provide simulation results 
which confirm our expectations and show that the block error 
rate can be improved by a factor of 100 in some cases. 

A. Some remarks 

First of all, one should note that the above method only 
works if there is some redundancy at the global level as well. 
If the set of weight matrices Wi , . . . , Wl define completely 
separate sub-spaces in the n/L-dimensional space, then for 
sure we gain nothing using this method. 

Secondly, there is no need to have the dimension of the 
subspaces to be equal to each other. We can have different 
lengths for the sub-patterns belonging to each subspace and 
different number of constraints for that particular sub-space. 
This gives us more degree of freedom as well since we can 
spend some time to find the optimal length of each sub-pattern 
for a particular training data set. 

Thirdly, the number of constraints for the second layer 
affects the gain one obtains in the error performance. Intu- 
itively, if the number of global constraints is large, we are 
enforcing more constraints so we expect obtaining a better 
error performance. We can think of determining the number 
even adaptively, i.e. if the error performance that we are getting 
is unacceptable, we can look deeper in patterns to identify 
their internal structure by searching for more constraints. This 
would be a subject of our future research. 

V. Simulation Results 

We have simulated the proposed learning algorithm in the 
multi-level architecture to investigate the block error rate of 
the suggested approach and the gain we obtain in error rates 
by adding a second level. We constructed 4 local networks, 
each with ri/4 pattern and m constraint nodes. 

A. Learning Phase 

We generated a sample data set of C = 10000 patterns of 
length n where each block of n/4 belonged to a subspace of 
dimension k < n/4. Note that C can be an exponential number 
in n. However, we selected C = 10000 as an example to show 
the performance of the algorithm because even for small values 



of k, and exponential number in k will become too large to 
handle numerically. The result of the learning algorithm is 
four different local connectivity matrices Wi , . . . , W4 as well 
as a global weight matrix Wg. The number of local constraints 
was 771 = n/4 — A: and the number of global constraints was 
nig — n — kt, where fcf is dimension of the subspace for 
overall pattern. The learning steps are done until 99% of the 
patterns in the training set converged. Table V-A summarizes 



other simulation parameters. For cases where At — 0, at was 



TABLE I 
Simulation parameters 



Parameter 


<5 


dt 


at (when At ^ 0) 


e 


P 


Value 


10 


0.25/t 


min(^,l + At) 


0.01 


0.01 

\\X\\2 



fixed to 0.49. 



Table V-A shows the average number of iterations executed 
before convergence is reached for different constraint nodes 
at the local and global level. It also gives the average sparisty 
ratio for the columns of matrix W. The sparsity ratio is defined 
as p = n/n, where k is the number of non-zero elements. 
From the figure one notices that as n increases, the vectors 
become sparser 

TABLE II 

Average number of convergence iterations and sparsity in the 
local and global networks for n = 400 





Sparsity Ratio 


Convergence Rate 


kt = 100 


kt = 200 


kt = 100 


kt = 200 


Local 


0.28 


0.32 


4808 


5064 


Global 


0.22 


0.26 


14426 


33206 



B. Recall Phase 

For the recall phase, in each trial we pick a pattern randomly 
from the training set, corrupt a given number of its symbols 
with ±1 noise and use the suggested algorithm to correct 
the errors. As mentioned earlier, the errors are corrected first 
at the local and the at the global level. When finished, we 
compare the output of the first and the second level with the 
original (uncorrupted) pattern x. A pattern error is declared if 



the output does not match at each stage. Table V-B shows the 
simulation parameters in the recall phase. 



TABLE III 
Simulation parameters 



Parameter 




^max 




Value 


0.8 


20|l2;||o 


0.01 



Figure [3] illustrates the pattern error rates n = 400 with two 
different values of kt = 100 and kt ~ 200. The results are also 
compared to that of the bit-flipping algorithm in |2| to show 
the improved performance of the proposed algorithm. As one 
can see, having a larger number of constraints at the global 
level, i.e. having a smaller kt, will result in better pattern 
error rates at the end of the second stage. Furthermore, note 
that since we stop the learning after 99% of the patterns had 



learned, it is natural to see some recall errors even for 1 initial 
erroneous node. 




Fig. 3. Pattern error rate against the initial number of erroneous nodes 



Table V-B shows the gain we obtain by adding an additional 
second level to the network architecture. The gain is calculated 
as the ratio between the pattern error rate at the output of the 
first layer and the pattern error rate at the output of the second 
layer 

TABLE IV 

Gain in Pattern Error Rate (PER) for different values of 

n = 400 AND INITIAL NUMBER OF ERRORS 



Number of initial 
errors 


Gain for kt = 
100 


Gain for kt = 
200 


2 


10.2 


2.79 


3 


6.22 


2.17 


4 


4.58 


1.88 


5 


3.55 


1.68 



VI. Future Works 

In order to extend the multi-level neural network, we must 
first find a way to generate patterns that belong to a subspace 
with dimensions nL — irig, where nig lies within the inside 
of bounds L{n — fc) < irig < nL — k. This will give us a 
way to investigate the trade off between the maximum number 
of memorizable patterns and the degree of error correction 
possible. 

Furthermore, so far we have assumed that the second level 
enforces constraints in the same space. However, it is possible 
that the second level imposes a set of constraints in a totally 
different space. For this purpose, we need a mapping from one 
space into another A good example is the written language. 
While they are local constraints on the spelling of the words, 
there are some constraints enforced by the grammar or the 
overall meaning of a sentence. The latter constraints are not 
on the space of letters but rather the space of grammar or 
meaning. Therefore, in order to for instance to correct an error 
in the word _at, we can replace _ with either h, to get hat, 
or c to get cat. Without any other clue, we can not find the 
correct answer However, let's say say we have the sentence 
"The _at ran away". Then from the constraints in the space 
of meaning we know that the subject must be an animal or 
a person. Therefore, we can return cat as the correct answer 
Finding a proper mapping is the subject of our future work. 
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